Matching pursuit approach to sparse Gaussian process regression

ABSTRACT

A system and method is provided for supervised learning. A training set is provided to the system. The system selects a training element from the provided training set, and adds the training element to a basis element set I. The system conducts an optimization test on the basis element set I with the selected training element to produce a selection score. The system determines whether the selection score indicates an improvement in optimization for the basis element set I. The system discards the selected element if the selection score does not indicate an improvement, and keeps the selected element if the selection score does indicate improvement. The process may then be repeated for other training elements until either the specified maximum number of basis functions is reached or improvement in optimization is below a threshold. At that point, the chosen set I should represent an optimized basis set.

RELATED APPLICATIONS

This application claims priority to U.S. Ser. No. 60/685,660, filed onMay 26, 2005 entitled “A MATCHING PURSUIT APPROACH TO SPARSE GAUSSIANPROCESS REGRESSION.”

FIELD OF THE INVENTION

The invention is a matching pursuit method for sparse gaussian process(GP) regression. Specifically, the invention provides a new basisselection system and method for building sparse GP regression modelsthat provides gains in accuracy as well as efficiency over previousmethods.

BACKGROUND OF THE INVENTION

Many problems in information processing involve the selection or rankingof items in a large data set. For example, a search engine that locatesdocuments meeting a search query must often select items from a largeresult set of documents to display to the user, and often must rankthose documents based on their relevance or other criteria. Similarexercises of selecting a set of items from a larger set of possibleitems are undertaken in other fields such as weather pattern prediction,analyzing commercial markets, and the like. Some complex mathematicalmodels and processes have been developed to perform this analysis.

One such set of processes, Bayesian Gaussian processes, provide aprobabilistic kernel approach to supervised learning tasks. Theadvantage of Gaussian process (GP) models over non-Bayesian kernelmethods, such as support vector machines, comes from the explicitprobabilistic formulation that yields predictive distributions for testinstances and allows standard Bayesian techniques for model selection.The cost of training GP models is O(n³), where n is the number oftraining instances, which results in a huge computational cost for largedata sets. Furthermore, when predicting a test case, a GP model requiresO(n) cost for computing the mean and O(n²) cost for computing thevariance. These heavy scaling properties obstruct the use of GPs inlarge scale problems.

Sparse GP models bring down the complexity of training as well astesting. The Nystrom method has been applied to calculate a reduced rankapproximation of the original n×n kernel matrix. One on-line algorithmmaintains a sparse representation of the GP models. Another algorithmuses a forward selection scheme to approximate the log posteriorprobability. Another fast and greedy selection method builds sparse GPregression models. All of these attempt to select an informative subsetof the training instances for the predictive model. This subset isusually referred to as the set of basis vectors, denoted as I. Themaximal size of I is usually limited by a value d_(max). As d_(max)<<n,the sparseness greatly alleviates the computational burden in bothtraining and prediction of the GP models. The performance of theresulting sparse GP models depends on the criterion used in the basisvector selection.

It would be desirable to provide a system and method for greedy forwardselection for sparse GP models. Accordingly, there's a need for a systemand method that yield better generalization performance, whileessentially not effecting algorithm complexity. The preferredembodiments of the system and method described herein clearly addressthis and other needs.

BRIEF SUMMARY OF THE INVENTION

In a preferred embodiment, a system and method is provided forsupervised learning. A training set is provided to the system. Thesystem selects a training element from the provided training set, andadds the training element to a set of basis elements I (basis elementset). The system conducts an optimization test on the basis element setI with the selected training element to produce a selection score. Thesystem determines whether the selection score indicates an improvementin optimization for the basis element set I. The system discards theselected element if the selection score does not indicate animprovement, and keeps the selected element if the selection score doesindicate improvement. The process may then be repeated for othertraining elements until either the specified maximum number of basisfunctions is reached or improvement in optimization is below athreshold. At that point, the chosen set I should represent an optimizedbasis set.

In one preferred embodiment, the selected element is selected at random.In another preferred embodiment, the system adds, in succession, aplurality of selected elements. In this embodiment, the scoring isperformed on the basis element set I every t^(th) addition of a selectedelement to the set of basis element set I, with t comprising a positiveinteger. This selection score is then used to decide whether to keep ordiscard the added selected elements.

In another preferred embodiment, the process of scoring the basiselement comprises a sparse Gaussian process.

In another preferred embodiment, the process of determining theselection score comprises using post-backfitting.

In another preferred embodiment, the training set is for categorizingindexed web pages for a search engine.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating components of a search engine inwhich one embodiment operates;

FIG. 2 is an example of a result set for the search engine of FIG. 1;

FIG. 3 is a flow diagram illustrating steps performed by the systemaccording to one embodiment associated with search engine relevanceranking;

FIG. 4 is a flow diagram illustrating further steps performed by oneembodiment of GP basis selection;

FIG. 5 is a flow diagram illustrating steps performed in a cachingalgorithm according to another embodiment; and

FIG. 6 is a flow diagram illustrating the steps performed in a modeladaptation algorithm according to another embodiment.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

A preferred embodiment of a matching pursuit approach to sparse gaussianprocess regression system, constructed in accordance with the claimedinvention, is directed towards a criterion of greedy forward selectionfor sparse GP models. The criterion used is more efficient than previousmethods, while not affecting algorithm complexity. While the method isdescribed as pertaining to regression, the method is also applicable toother supervised learning tasks.

In one embodiment, as an example, and not by way of limitation, animprovement in Internet search engine categorization and scoring of webpages is provided. The World Wide Web is a distributed databasecomprising billions of data records accessible through the Internet.Search engines are commonly used to search the information available oncomputer networks, such as the World Wide Web, to enable users to locatedata records of interest. A search engine system 100 is shown in FIG. 1.Web pages, hypertext documents, and other data records from a source101, accessible via the Internet or other network, are collected by acrawler 102. The crawler 102 collects data records from the source 101.For example, in one embodiment, the crawler 102 follows hyperlinks in acollected hypertext document to collect other data records. The datarecords retrieved by crawler 102 are stored in a database 108.Thereafter, these data records are indexed by an indexer 104. Indexer104 builds a searchable index of the documents in database 108. Commonprior art methods for indexing may include inverted files, vectorspaces, suffix structures, and hybrids thereof. For example, each webpage may be broken down into words and respective locations of each wordon the page. The pages are then indexed by the words and theirrespective locations. A primary index of the whole database 108 is thenbroken down into a plurality of sub-indices and each sub-index is sentto a search node in a search node cluster 106.

To use search engine 100, a user 112 typically enters one or more searchterms or keywords, which are sent to a dispatcher 110. Dispatcher 110compiles a list of search nodes in cluster 106 to execute the query andforwards the query to those selected search nodes. The search nodes insearch node cluster 106 search respective parts of the primary indexproduced by indexer 104 and return sorted search results along with adocument identifier and a score to dispatcher 110. Dispatcher 110 mergesthe received results to produce a final result set displayed to user 112sorted by relevance scores. The relevance score is a function of thequery itself and the type of document produced. Factors that affect therelevance score may include: a static relevance score for the documentsuch as link cardinality and page quality, placement of the search termsin the document, such as titles, metadata, and document web address,document rank, such as a number of external data records referring tothe document and the “level” of the data records, and documentstatistics such as query term frequency in the document, global termfrequency, and term distances within the document. For example, TermFrequency Inverse Document Frequency (TFIDF) is a statistical techniquethat is suitable for evaluating how important a word is to a document.The importance increases proportionally to the number of times a wordappears in the document but is offset by how common the word is in allof the documents in the collection.

Referring to FIG. 2, there is shown an example of a result set 120. Asshown therein, in response to a query 122 for the search term “MP3player” shown on the top of the figure, the search engine YAHOO!searched its web index and produced a plurality of results in the formof result set 120 displayed to a user. For brevity, only a first page ofresult set 120 is shown. Result set 120 includes ten results 124 a-f,each with a respective clickable hyperlink 126 a-j, description 127 a-j,and Internet addresses or uniform resource locator (URL) 128 a-j fordata records that satisfy query 122.

Usually the number of web pages in the result set is very large,sometimes even as large as a million. It is important to ensure that thedocuments displayed to the user are ordered according to relevance, withthe most relevant displayed at the top. In one embodiment, for a givenquery-webpage pair, the Gaussian process (GP) nonlinear regressionmethod described herein is used as a relevance scoring function.

With reference to FIG. 3, a flow diagram illustrates the steps performedaccording to one embodiment. In step 300, a small set of random queries(e.g., a few 1000) are selected. In step 302, each query is processed,and a few hundred web pages are selected from the results for eachquery. Each web page forms a query-webpage pair associated with thequery by which the web page was selected, step 304. An editorial team,or some other system, provides relevance scores, for example, valuesfrom 0-100, step 305. These are the scores that form the output trainingvalues (y) that the GP emulates. The number of example pairs are severalhundreds of thousands, and a normal GP cannot handle this number withany efficiency. The sparse GP method described herein is then used toselect a basis set, step 308 which is a special subset of thequery-webpage pairs. In step 310, the final GP method is formed usingthe selected basis set.

According to one embodiment, the input vector (x) associated with aquery-webpage pair is set to be selected features that help inprediction. By way of example, and not by way of limitation, thosefeatures include: query term occurrence frequency in the web page, webpage length, eigenrank of the web page, spam index of the web page,first occurrence location of the query terms, family friendly rating ofthe web page, proximity of occurrences of the query terms in the webpage, and the like. The relevance score is the output variable, y thatis to be predicted.

In one embodiment, the method is implemented in the indexer/categoriser104 as a supervised learning task. One of the advantages of the system100 is that it results in faster categorization by using smaller andmore accurate basis sets. In one embodiment, the method is implementedin software on a server or personal computer. However, those skilled inthe art would recognize that the method can be implemented on anycombination of hardware or software platforms that can be programmed toimplement the methods described herein. Further, the method is notlimited to use in search engine technology, but is generally useful inany kind of training or regression problem.

In regression problems, a training data set is provided composed of nsamples. Each sample is a pair having an input vector x_(i)ε

^(m) and its corresponding target x_(i)ε

.

is the set of reals and

^(m) is the m dimensional real vector space. The true function value atx_(i) is represented as an unobservable latent variable f(x_(i)) and thetarget y_(i) is a noisy measurement of f(x_(i)). The goal is toconstruct a predictive model that estimates the relationship x

f(x).

In standard GPs for regression, the latent variables {f(x_(i))} arerandom variables in a zero mean Gaussian process indexed by {x_(i)}. Theprior distribution of {f(x_(i))} is a multivariate joint Gaussian,denoted asP(f)=N(f;0,K)where f=[f(x₁), . . . , f(x_(n))]^(T) and K is the n×n covariance matrixwhose ij-th element is K(x_(i), x_(j)), K being the kernel function. Pdenotes the probability function. The likelihood is essentially a modelof the measurement noise, which is conventionally evaluated as a productof independent Gaussian noises,P(y|f)=N(y;f,σ ² I)where y=[y₁, . . . , y_(n)]^(T) and σ² is the noise variance. N(f, μ,σ²) denotes the normal density function with mean μ and the variance σ².The posterior distribution P(f|y)∝P(y|f)P(f) is also exactly a Gaussian:P(f|y)=N(f;Kα*;σ ² K(K+σ ² I)⁻¹)  (1)where α*=(K+σ²I)⁻¹y. For any test instance x, the predictive model isrepresented by the predictive distributionP(f(x)|y)=∫P(f(x)|f)P(f|y)df=N(f(x);μ_(x),σ_(x) ²)whereμ_(x) =k ^(T)(K+σ ² I)⁻¹ y=k ^(T)α*  (2)σ_(x) ²=K(x,x)−k ^(T)(K+σ ² I)⁻¹ k  (3)and k=[K(x₁, x), . . . , K(x_(n), x)]^(T). The computational cost oftraining is O(n³), which mainly comes from the need to invert the matrix(K+σ²I) and obtain the vector α*. For predicting a test instance thecost is O(n) to compute the mean (2) and O(n²) for computing thevariance (3). This heavy scaling with respect to n makes the use ofstandard GP computationally prohibitive on large datasets.

As opposed to other methods for working with a reduced number of latentvariables, instead of assuming n latent variables for all the traininginstances, sparse GP models assume only d latent variables placed atsome chosen basis vectors {{tilde over (x)}_(i)}, denoted as a columnvector f_(I)=[f({tilde over (x)}₁), . . . , f({tilde over(x)}_(d))]^(T). The prior distribution of the sparse GP is a jointGaussian over f_(I) only, i.e.,P(f _(I))=N(f _(I);0,K _(I))  (4)where K_(I) is the d×d covariance matrix of the basis vectors whoseij-th element is K({tilde over (x)}_(i), {tilde over (x)}_(j))

These latent variables are then projected to all the training instances.The conditional mean at the training instances is K_(I,) ^(T).K_(I)⁻¹f_(I:), where K_(I,). is a d×n matrix of the covariance functionsbetween the basis vectors and all the training instances. The likelihoodcan be evaluated by these projected latent variables as followsP(y|f _(I))=N(y;K _(I,) ^(T) .K _(I) ⁻¹ f _(I),σ² I)  (5)The posterior distribution can then be written asP(f _(I)|y)=N(f _(I) ;K _(I)α_(I)*,σ² K _(I)(σ² K _(I) +K _(I) +K _(I,).K _(I,) ^(T).)⁻¹ K _(I))  (6)where α_(I)*=(σ²K_(I)+K_(I,).K_(I,) ^(T).)⁻¹K_(I,).y.

The predictive distribution at any test instance x is given byP(f(x)|y)=∫P(f(x)|f _(I))P(f _(I)|y)df _(I)=N(f(x);{tilde over(μ)}_(x),{tilde over (σ)}_(x) ²)where{tilde over (μ)}_(x) ={tilde over (k)} ^(T)α_(I)*  (7){tilde over (σ)}_(x) ²=K(x,x)−{tilde over (k)} ^(T) K _(I) ⁻¹ {tildeover (k)} ^(T)+σ² {tilde over (k)} ^(T)(σ² K _(I) +K _(I,) .K _(I,)^(T).)⁻¹ {tilde over (k)}  (8)and {tilde over (k)} is a column vector of the covariance functionsbetween the basis vectors and the test instance x, i.e. {tilde over(k)}=[K({tilde over (x)}₁, x), . . . , K({tilde over (x)}_(d), x)]^(T).

In large scale regression methods, the maximal size of the basis vectorsis limited by a relatively small value d_(max) that is much less thanthe number of the training instances n. A value for d_(max) can bedecided based on the CPU time that is available during training and/ortesting.

While the cost of training the full GP model is O(n³), the trainingcomplexity of sparse GP models is only O(nd_(max) ²). This correspondsto the cost of forming K_(I) ⁻¹, (σ²K_(I)+K_(I,).K_(I,) ^(T).)⁻¹ andα_(I)*. Thus, if d_(max) is not big, learning on large datasets can beaccomplished via sparse GP models. Also, for these sparse models,prediction for each test instance costs O(d_(max)) for the mean (7) andO(d_(max) ²) for the variance (8).

The mean of the posterior distribution is the maximum a posteriori (MAP)estimate, and it is possible to give an equivalent parametricrepresentation of the latent variables as f=Kα where α=[α₁, . . . ,α_(n)]^(T). The MAP estimate of the full GP is equivalent to minimizingthe negative logarithm of the posterior (1): $\begin{matrix}{{\min\limits_{\alpha}{{\pi(\alpha)}\text{:} = \frac{1}{2}{\alpha^{T}\left( {{\sigma^{2}K} + {K^{T}K}} \right)}\alpha}} - {y^{T}K\quad\alpha}} & (9)\end{matrix}$

Similarly, using f_(I)=K_(I)α_(I) for sparse GP models, the MAP estimateof the sparse GP is equivalent to minimizing the negative logarithm of(equation 6 above): $\begin{matrix}{{\min\limits_{\alpha_{I}}{{\overset{\sim}{\pi}\left( \alpha_{I} \right)}\text{:}}} = {{\frac{1}{2}{\alpha_{I}^{T}\left( {{\sigma^{2}K_{I}} + {K_{I,}.K_{I,}^{T}.}} \right)}\alpha_{I}} - {y^{T}{K_{I,}^{T}.\alpha_{I}}}}} & (10)\end{matrix}$

Suppose α in equation 9 is composed of two parts, i.e. α=[α_(I); α_(R)]where I denotes the set of basis vectors and R denotes the remaininginstances. The optimization problem of equation 10 is the same asminimizing π(α) in equation 9 using α_(I) only, namely, with theconstraint, α_(R)=0. In other words, the basis vectors of the sparse GPscan be selected to minimize the negative log-posterior of the full GPs,π(α) defined as in equation 9.

In the sparse GP approach described above, the choice of I, the set ofbasis vectors, can make a difference. Generally the basis vectors can beplaced anywhere in the input space

^(m). Since training instances usually cover the input space of interestadequately, in some methods, basis vectors are selected from just theset of training instances. For a given problem, d_(max) is chosen to beas large as possible subject to constraints on computational time intraining and/or testing. A basis selection method is then used to find Iof size d_(max).

One cheap method (in terms of processing time requirements) is to selectthe basis vectors at random from the training data set. But, such achoice may not work well when d_(max) is much smaller than n. In oneembodiment, an I is selected that makes the corresponding sparse GPapproximate well the posterior distribution of the full GP. In oneembodiment, the optimization formulation described above can be used. Inone embodiment, it is preferable to choose, among all subsets, I of sized_(max), the one that gives the best value of {tilde over (π)} in (10).Such a combinatoric search is expensive. In another embodiment, acheaper approach is to do greedy forward selection used in priormethods.

With regard to time complexities associated with forward selection,there are two costs involved. There is a basic cost associated withupdating of the sparse GP solution, given a sequence of chosen basisfunctions, referred to as T_(basic). This cost is the same for allforward selection methods, and is O(nd_(max) ²). Then, depending on thebasis selection method, there is the cost associated with basisselection. The accumulated value of this cost for choosing all d_(max)basis functions is referred to as T_(selection). Forward basis selectionmethods differ in the way they choose effective basis functions whilekeeping T_(selection) small. It is noted that the total cost associatedwith the random basis selection method mentioned earlier isT_(random)=T_(basic)=O(nd_(max) ²). This cost forms a baseline forcomparison.

One example is a typical situation in forward selection having a currentworking set I, and the next basis vector, x_(i), is to be selected. Onemethod evaluates each given x_(i)∉I by trying its complete inclusion,i.e., set I′=I∪{x_(i)} and optimize π(α) using α_(I′)=[(α_(I); α_(i)].Thus, the selection criterion for the instance x_(i)∉I is the decreasein π(α) that can be obtained by allowing both α_(I) and α_(i) asvariables to be non-zero. The minimal value of π(α) can be obtained bysolving min_(α) _(I′) {tilde over (π)}(α_(I′)) defined in equation 10.This costs O(nd) time for each candidate, x_(i), where d is the size ofthe current set, I. If all x_(i)∉I need to be tried, it will lead toO(n²d) cost. Accumulated till d_(max) basis functions are added, thisleads to a T_(selection) that has O(n²d_(max) ²) complexity, which isdisproportionately higher than T_(basic). Other methods have resorted toa randomized scheme by considering only κ basis elements randomly chosenfrom outside I during one basis selection. A value of κ=59 is normallyused, although other values can be used. For this randomized method, thecomplexity of T_(selection) is O(κnd_(max) ²). Although, from acomplexity viewpoint, T_(basic) and T_(selection) are comparable, itshould be noted that the overall cost of the method is about 60 timesthat of T_(random).

In one relatively cheap heuristic criterion for basis selection, the“informativeness” of an input vector x_(i)∉I is scored by theinformation gain $\begin{matrix}{D\left\lbrack {{{Q\left( {f_{I^{\prime}}\text{❘}y} \right)}\left. {P\left( {f_{I^{\prime}}❘y} \right)} \right\rbrack} = {\int{{Q\left( {f_{I^{\prime}}\text{❘}y} \right)}\log\frac{Q\left( {f_{I^{\prime}}\text{❘}y} \right)}{P\left( {f_{I^{\prime}}\text{❘}y} \right)}{\mathbb{d}f_{I^{\prime}}}}}} \right.} & (11)\end{matrix}$where I′ denotes the new set of basis vectors after including a newelement x_(i) into the current set I, and f_(I′) denotes the vector oflatent variables of I′. P(f_(I′)|y) is the true posterior distributionof f_(I′) defined as in equation 6. Q(f_(I′)|y)∝Q(y|f_(I′))P(f_(I′)) isa posterior approximation, in which the “pseudoinclusion” likelihood isdefined asQ(y|f _(I′))∝N(y^(\i);(K _(I,) ^(\i).)^(T) K _(I) ⁻¹ f _(I),σ² I_(n−1))N(y_(i) |f(x_(i)),σ²)  (12)where y^(\i) denotes the target vector after removing y_(i), K_(I,)^(\i). denotes the d×(n−1) covariance matrix after removing, fromK_(I,)., the column corresponding to x_(i) and f(x_(i)) denotes thelatent variable at y^(\i). In the proper inclusion as in equation 5, thelatent variable f(x_(i)) would be coupled with the targets y^(\i) in thelikelihood, whereas the approximation (equation 12) ignores thesedependencies. Due to this simplification, the score in equation 11 canbe computed in O(1) time, given the current predictive model representedby I. Thus, the scores of all instances outside I can be efficientlyevaluated in O(n) time, which makes this algorithm almost as fast asusing random selection. In one embodiment, the correlation in theremaining instances {x_(i): x_(i)∉I} is not used, as the criterion(equation 11) is evaluated upon the current model.

The two methods presented above are extremes in efficiency. In onemethod, T_(selection) is disproportionately larger than T_(basic) while,in the other method, T_(selection) is very much smaller than T_(basic).Now described is a moderate method that is effective and whosecomplexity is in between the two above described methods. This methoduses a kernel matching pursuit approach.

Kernel matching pursuit is a sparse method for ordinary least squaresthat includes two general greedy sparse approximation schemes, called“pre-backfitting” and “post-backfitting.” The presently described methodillustrates that both methods can be generalized to select the basisvectors for sparse GPs. This method is an efficient selection criterionthat is based on post-backfitting methodology.

With reference to FIG. 4, a flow diagram shows steps performed by thesystem 100 according to this method. Given the current I, the minimalvalue of π(α) when it is optimized using only α_(I) as variables isequivalent to min_(α) _(I) {tilde over (π)}(α_(I)) as in equation 10,step 400. The minimizer, denoted as α_(I)*, is given byα_(I)*=(σ² K _(I) +K _(I,) .K _(I,) ^(T).)⁻¹ K _(I,).y  (13)The scoring criterion for an instance x_(i)∉I is based on optimizingπ(α) by fixing α_(I)=α_(I)* and changing α_(i), only. On step 402, theone-dimensional minimizer can be found as $\begin{matrix}{\alpha_{i}^{*} = \frac{{K_{i,}^{T}.\left( {y - {K_{I,}^{T}.\alpha_{I}^{*}}} \right)} - {\sigma^{2}{\overset{\sim}{k}}_{i}^{T}\alpha_{I}^{*}}}{{\sigma^{2}{K\left( {x_{i},x_{i}} \right)}} + {K_{i,}^{T}.K_{i,}.}}} & (14)\end{matrix}$where K_(i,). is the n×1 matrix of covariance functions between x_(i),and all the training data, and {tilde over (k)}_(i) is a d dimensionalvector having K(x_(j), x_(i)), x_(j)εI. In Step 404, the selection scoreof the instance x_(i), is the decrease in π(α) achieved by the onedimensional optimization of α_(i), which can be written in closed formas $\begin{matrix}{\Delta_{i} = {\frac{1}{2}{\alpha_{i}^{*}\left( {{K_{i,}^{T}.\left( {y - {K_{I,}^{T}.\alpha_{I}^{*}}} \right)} - {\sigma^{2}{\overset{\sim}{k}}_{i}^{T}\alpha_{I}^{*}}} \right)}}} & (15)\end{matrix}$Note that a full kernel column K_(i,). is used and so it costs O(n) timeto compute (15). In contrast, for scoring one instance. Previous methodsrequire, for example, O(nd) time and or O(1) time.

In one embodiment, processing is performed on all x_(i)∉I, step 406, andthe instance which gives the largest decrease is selected, step 408.This uses O(n²) effort. In summing the cost till d_(max) basis vectorsare selected, there is an overall complexity of O(n²d_(max)) which ismuch higher than T_(basic).

In one embodiment, to restrict the overall complexity of T_(selection)to O(n²d_(max)), the method randomly selects basis vectors whichprovides a relatively good selection rather than the absolute bestselection. Since it costs only O(n) time to evaluate the selectioncriterion in equation 15 for one instance, the method can choose thenext basis vector from a set of dolt instances randomly selected fromoutside of I. Such a selection method keeps the overall complexity ofT_(selection) to O(nd_(max) ²). From a practical point of view thescheme can be expensive because the selection criterion (equation 15)preferably computes a full kernel row K_(I,). for each instance to beevaluated.

As kernel evaluations could be very expensive, in one embodiment, amodified method is used to keep the number of such evaluations small.FIG. 5 is a flow diagram illustrating steps performed by the system 100according to this embodiment. In this embodiment, a matrix cache ismaintained, C, of size c×n, that contains c rows of the full kernelmatrix K. At the beginning of the algorithm (when I is empty) C isinitialized by randomly choosing c training instances, computing thefull kernel row, K_(I,). for the chosen i's and putting them in the rowsof C, step 500.

Each step corresponding to a new basis vector selection proceeds asfollows. First Δ_(i) is computed for the c instances corresponding tothe rows of C, step 502, and the instance with the highest score isselected for inclusion in I, step 504. Let x_(j) denote the chosen basisvector. Then the remaining instances are sorted (that define C)according to their Δ_(i) values. Finally, the system selects r, freshinstances (from outside of I and the vectors that define C) to replacex_(j) and the κ−1 cached instances with the lowest score. Thus, in eachbasis selection step, the system computes the criterion scores for cinstances, but evaluates full kernel rows only for κ fresh instances,step 508.

One advantage of the above scheme is that, those basis elements whichhave very good scores, but are overtaken by another better element in aparticular step, continue to remain in C and, according to probability,get selected in future basis selection steps. The method uses κ=59. Thevalue of c can be set to be any integer between κ and d_(max). For any cin this range, the complexity of T_(selection) remains at mostO(nd_(max) ²).

The above cache method is unique to the point that it cannot be used inpreviously known methods without unduly increasing complexity. In oneembodiment, an extra cache is used for storing kernel rows of instanceswhich get discarded in one step, but which are to be considered again ina future step.

With respect to model adaptation, once d_(max) basis vectors areselected from the training instances according to a criterion, themarginal likelihood of the sparse GP model for model selection iscomputed. The sparse GP model is conditional on the parameters in thekernel function and the Gaussian noise level σ², which can be collectedtogether in θ, the hyperparameter vector. The likelihood approximationwith projected latent variables (equation 5), and equation 4, leads tothe following marginal likelihoodP(y❘θ) = ∫P(y❘f_(I))P(f_(I))𝕕f_(I) = N(y❘0, σ²I + K_(I,)^(T).K_(I)⁻¹K_(I,).)which is known as the evidence, a yardstick for model selection. Theoptimal values of θ can be inferred by maximizing the evidence.Equivalently, φ(θ)=−log P(y|θ) can be minimized. This can be performedby using gradient-based optimization techniques.

One of the problems in using gradient based methods is the dependence ofI on θ that makes φ a non-differentiable function. In one embodiment,with reference to FIG. 6, this issue is resolved by repeating the stepsshown therein. In step 600, I is fixed and a small gradient basedadaptation of θ is performed. At step 602, θ fixed and I is selected bythe basis selection algorithm. It is preferable to avoid a detailedminimization in this step since that may lead to a θ where basisselection may yield a set very different from I and hence give a verydifferent true value of the evidence at that θ. In one embodiment, forexample, at most, only two gradient searches are performed. For thecache-based post-backfitting method of basis selection, after step 604,the initial kernel cache, C, is set using the rows of K_(I,). at theprevious θ, step 604.

Because of the alternation with steps 600 and 602, in some instances,the gradients and I will not stabilize cleanly. Hence, smoothconvergence cannot be expected in the model adaptation process. In oneembodiment, to resolve this issue, a method is used for relativeimprovement on the evidence value in two iteration periods. The relativeimprovement at iteration t is defined as$\Lambda^{(t)} = \frac{\phi^{({\tau_{1},{t - \tau_{2}}})} - \phi^{({\tau_{1},t})}}{\phi^{({\tau_{1},{t - \tau_{2}}})}}$where τ₁ and τ₂ are lag parameters satisfying 0<τ₁≦τ₂, φ^((t)) is theminimal value of φ after the t-th major iteration, andφ^((τ,t))=min{φ^((t−τ+1)), . . . , φ^((t))}, step 608. In oneembodiment, it is preferable to use τ₁=τ₂=10, and to stop the adaptationprocedure when Λ^((t))≦0.001, step 610. In one embodiment, a minimum ofτ₁+τ₂ iterations are first performed before the convergence criterion isused.

In one embodiment, in forward selection of basis functions, thefollowing quantities are incrementally updated: the Choleskyfactorization LL^(T) of A=σ²K_(I)+K_(I,).K_(I,) ^(T)., K_(I,).β={tildeover (L)}⁻¹K_(I,) ^(T).y, α_(I)*={tilde over (L)}^(−T)β and μ=K_(I,)^(T).α_(I)*.

The negative logarithm of evidence can be computed as$\phi = {{\log{\overset{\sim}{L}}} - {\log{L}} + {\frac{n - d_{\max}}{2}\log\quad\sigma^{2}} + {\frac{1}{2\sigma^{2}}\left( {{y^{T}y} - {\beta^{T}\beta}} \right)}}$

The derivative of φ with respect to log σ² is$\frac{\partial\phi}{{\partial\log}\quad\sigma^{2}} = {\frac{1}{2}\left( {{\sigma^{2}{{tr}\left( {L^{T}A^{- 1}L} \right)}} + \left( {n - d_{\max}} \right) + {\alpha_{I}^{*T}K_{I}\alpha_{I}^{*}} - {\frac{1}{\sigma^{2}}\left( {{y^{T}y} - {\beta^{T}\beta}} \right)}} \right)}$

In the above, tr(L^(T)A⁻¹L) can be computed by first computing {tildeover (L)}⁻¹L and then computing the sum of squares of all elements ofthat matrix. The derivative with respect to a general hyperparameter, vin the kernel function is:$\frac{\partial\phi}{\partial\upsilon} = {{{- \frac{1}{2}}{{tr}\left( {K_{I}^{- 1} - {\sigma^{2}A^{- 1}}} \right)}\frac{\partial K_{I}}{\partial\upsilon}} + {\frac{1}{2}\alpha_{I}^{*T}\frac{\partial K_{I}}{\partial\upsilon}\alpha_{I}^{*}} + {{tr}\quad B^{T}\frac{{\partial K_{I,}}.}{\partial\upsilon}} - {\frac{1}{\sigma^{2}}\alpha_{I}^{*T}\frac{{\partial K_{I,}}.}{\partial\upsilon}\left( {y - \mu} \right)}}$where B=A⁻¹K_(I,). The i-th column, p_(i) of B can be computed bysolving two lower triangular systems: {tilde over (L)}z=q_(i), whereq_(i), is the i-th column of K_(I,) and then {tilde over(L)}^(T)p_(i)=z. Computation of B is one of the most expensivecomputations of the algorithm; it costs O(nd_(max) ²). Note that B andthe cache matrix C can occupy the same storage space. Also, note that$\frac{\partial K_{\mathcal{I}}}{\partial\upsilon}$can be obtained from$\frac{{\partial K_{\mathcal{I},}}.}{\partial\upsilon},$and $\frac{{\partial K_{\mathcal{I},}}.}{\partial\upsilon},$does not require storage space, as the kernel values in K_(I,). can beused to compute the elements of$\frac{{\partial K_{\mathcal{I},}}.}{\partial\upsilon}$directly.

Thus, a more efficient system and method for regression and othersupervised learning tasks has been described. While the system andmethod is generally described herein in the context of textcategorization, specifically with respect to training vectors forcategorizing indexed web pages, those skilled in the art would recognizethat the system and method can be applied to any problem or systemrequiring selection of training vectors or elements, such as in weatherpattern prediction, commercial market analysis, and the like.

Although the invention has been described in language specific tocomputer structural features, methodological acts, and by computerreadable media, it is to be understood that the invention defined in theappended claims is not necessarily limited to the specific structures,acts, or media described. Therefore, the specific structural features,acts and mediums are disclosed as exemplary embodiments implementing theclaimed invention.

Furthermore, the various embodiments described above are provided by wayof illustration only and should not be construed to limit the invention.Those skilled in the art will readily recognize various modificationsand changes that may be made to the claimed invention without followingthe example embodiments and applications illustrated and describedherein, and without departing from the true spirit and scope of theclaimed invention, which is set forth in the following claims.

1. A computerized method for supervised learning, comprising: receivinga set of training elements; selecting a basis element from the set oftraining elements; adding the basis element to a basis element set;conducting an optimization test on the basis element set with theselected basis element to produce a selection score; determining whetherthe selection score indicates an improvement in optimization for thebasis element set; and discarding the basis element if the selectionscore does not indicate an improvement.
 2. The method of claim 1,comprising retaining the basis element in the basis element set if theselection score indicates an improvement.
 3. The method of claim 1,comprising repeating the method for a plurality of additional basiselements.
 4. The method of claim 1, wherein selecting the basis elementcomprises selecting at random.
 5. The method of claim 1, whereinselecting the basis element comprises selecting manually.
 6. The methodof claim 1, comprising caching the basis element set and adding, insuccession, a plurality of basis elements.
 7. The method of claim 6,wherein the step of scoring is performed on the training set everyt^(th) addition of a basis element to the basis training set, wherein tcomprises a positive integer.
 8. The method of claim 1, wherein thetraining set comprises elements representing web pages located in asearch by a search engine.
 9. The method of claim 1, wherein the scoringcomprises using a sparse Gaussian process.
 10. The method of claim 1,wherein determining the selection score comprises usingpost-backfitting.
 11. The method of claim 9, comprising applying aminimizer algorithm.
 12. The method of claim 11, comprising applying thefollowing formula in a mimimizer algorithm:α_(I)*=(σ² K _(I) +K _(I,) .K _(I,) ^(T).)⁻¹ K _(I,).y where α_(I)* isthe minimizer, K_(I) is the d×d covariance matrix of the basis vectors,σ² is the noise variance, y is the target output vector.
 13. The methodof claim 12, wherein comprising applying the following formula to findthe minimizer:$\alpha_{i}^{*} = \frac{{K_{i}^{T}{\text{,}.\left( {y - {K_{\mathcal{I}}^{T}{\text{,}.\alpha_{\mathcal{I}}^{*}}}} \right)}} - {\sigma^{2}{\overset{\sim}{k}}_{i}^{T}\alpha_{\mathcal{I}}^{*}}}{{\sigma^{2}{{??}\left( {x_{i},x_{i}} \right)}} + {K_{i}^{T}{\text{,}.K_{i}}{\text{,}.}}}$where K_(i), is the n×1 matrix of covariance functions between x_(i),and all the training data, and {tilde over (k)}_(i) is a d dimensionalvector having K(x_(j), x_(i)), x_(j)εI.
 14. The method of claim 13,comprising applying the following formula to find the selection score,Δ_(i):$\Delta_{i} = {\frac{1}{2}{\alpha_{i}^{*}\left( {{K_{i}^{T}{\text{,}.\left( {y - {K_{\mathcal{I}}^{T}{\text{,}.\alpha_{\mathcal{I}}^{*}}}} \right)}} - {\sigma^{2}{\overset{\sim}{k}}_{i}^{T}\alpha_{\mathcal{I}}^{*}}} \right)}}$15. A system for supervised learning, comprising: a training elementset; a processor for selecting a basis element from the training elementset; the processor for adding the basis element to a basis element set;and the processor for conducting an optimization test on the basiselement set for optimization with the basis element to produce aselection score, and for determining whether the selection scoreindicates an improvement in optimization for basis element set, and fordiscarding the basis element if the selection score does not indicate animprovement.
 16. The system of claim 15, wherein the processor selectsthe basis element at random.
 17. The system of claim 15, wherein theprocessor is for retaining the selected element in the basis element setif the selection score indicated an improvement.
 18. The system of claim15, wherein the processor is for repeating the selection of a basiselement a plurality of times.
 19. The system of claim 15, wherein theprocessor is further for caching the basis element set, and adding insuccession, a plurality of basis elements.
 20. The system of claim 19,wherein the processor scores the basis element set every t^(th) additionof a basis element to the basis element set, t comprising a positiveinteger.
 21. The system of claim 15, wherein the training element setcomprises elements representing web pages located by a search engine.22. The system of claim 15, wherein the processor uses a sparse Gaussianprocess for scoring.
 23. The system of claim 15, wherein the processordetermines the selection score using post-backfitting.
 24. The system ofclaim 15, wherein the processor applies a minimizer algorithm.
 25. Thesystem of claim 24, wherein the processor applies the following formulain the mimimizer algorithm:α_(I)*=(σ² K _(I) +K _(I,) .K _(I,) ^(T).)⁻¹ L _(I,).y where α_(I)* is aminimizer, K_(I) is the d×d covariance matrix of the basis vectors, σ²is the noise variance, y is the target output vector.
 26. The system ofclaim 25, wherein the processor applies the following formula to findthe minimizer:$\alpha_{i}^{*} = \frac{{K_{i}^{t}{\text{,}.\left( {y - {K_{\mathcal{I}}^{T}{\text{,}.\alpha_{\mathcal{I}}^{*}}}} \right)}} - {\sigma^{2}{\overset{\sim}{k}}_{i}^{T}\alpha_{\mathcal{I}}^{*}}}{{\sigma^{2}{{??}\left( {x_{i},x_{i}} \right)}} + {K_{i}^{T}{\text{,}.K_{i}}{\text{,}.}}}$where K_(i), is the n×1 matrix of covariance functions between x_(i),and all the training data, and {tilde over (k)}_(i) is a d dimensionalvector having K(x_(j), x_(i)), x_(j)εI.
 27. The system of claim 26,wherein the processor applies the following formula to find theselection score, Δ_(i):$\Delta_{i} = {\frac{1}{2}{\alpha_{i}^{*}\left( {{K_{i}^{T}{\text{,}.\left( {y - {K_{\mathcal{I}}^{T}{\text{,}.\alpha_{\mathcal{I}}^{*}}}} \right)}} - {\sigma^{2}{\overset{\sim}{k}}_{i}^{T}\alpha_{\mathcal{I}}^{*}}} \right)}}$28. A method of producing a set of web pages located by a search engineas the result of a search query, the method comprising: generating aninitial result set of web pages; selecting one or more web pages to addto the initial result set; selecting one or more of the web pages fromthe initial result as one or more basis elements; adding the one or morebasis elements to a basis element set; conducting an optimization teston the basis element set with the basis elements to produce a selectionscore; determining whether the selection score indicates an improvementin optimization for the initial result set; and discarding the selectedelement if the selection score does not indicate an improvement.
 29. Themethod of claim 28, comprising retaining the one or more basis elementsin the basis element set if the selection score indicates animprovement.
 30. The method of claim 28, comprising repeating the methodfor a plurality of additional basis elements.
 31. The method of claim28, wherein selecting the basis element comprises selecting at random.32. The method of claim 28, wherein selecting the basis elementcomprises selecting manually.
 33. The method of claim 28, comprisingcaching the basis element set and adding, in succession, a plurality offurther basis elements.
 34. The method of claim 33, wherein the step ofscoring is performed on the training set every t^(th) addition of one ormore basis elements to the training set, wherein t comprises a positiveinteger.
 35. The method of claim 28, wherein the scoring comprises usinga sparse Gaussian process.
 36. The method of claim 28, whereindetermining the selection score comprises using post-backfitting.